Here’s how to scrape 1 page of the policy database of the IEA, in a very step-by-step fashion.
The site https://www.iea.org/policies looks like this:
It’s got a table of policies, but the table doesn’t contain all the information we want about those policies.
To get that information, we will have to follow the links to those policies, and extract it from that page.
To get and parse (understand) the page, we can use rvest
read_html()
library(rvest)
html <- read_html("https://www.iea.org/policies")
html
## {html_document}
## <html dir="ltr" lang="en-GB" class="no-js page-all-policies ">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <!-- Google Tag Manager (noscript) -->\n <noscript>\n ...
The variable html now holds our parsed html from the
page
We can see the links on the page, and if we open the developer tools window, we can find the element in the html
Highlighted, we see an element which looks like this:
<a class="m-policy-listing-item__link" href="/policies/11663-fuel-economy-standards-on-light-duty-vehicles">
Fuel Economy Standards on Light-Duty Vehicles
</a>
We can search for the links using the html_elements()
function from rvest. We search using CSS selectors, which is just one
way to select elements from a web page. This is a
nice explanation of how these work, and you can also try these
interactively here (you can copy
your html by rght clicking on the body tag of your website (right at the
top of the elements tab) and selecting copy outer HTML).
In our case we can see that our links are elements of the type
a (because a is the first word after the <
character). They also have the class “m-policy-listing-item__link” (pay
attention to the number of underscores!). This means our selector looks
like this "a.m-policy-listing-item__link", which means
“select all a elements that have the class
m-policy-listing-item__link”.
links <- html %>% html_elements("a.m-policy-listing-item__link")
links
## {xml_nodeset (30)}
## [1] <a class="m-policy-listing-item__link" href="/policies/11663-fuel-econom ...
## [2] <a class="m-policy-listing-item__link" href="/policies/12654-emissions-l ...
## [3] <a class="m-policy-listing-item__link" href="/policies/8506-gas-boilers- ...
## [4] <a class="m-policy-listing-item__link" href="/policies/3124-local-govern ...
## [5] <a class="m-policy-listing-item__link" href="/policies/12046-decommissio ...
## [6] <a class="m-policy-listing-item__link" href="/policies/8401-enhancements ...
## [7] <a class="m-policy-listing-item__link" href="/policies/12197-heavy-goods ...
## [8] <a class="m-policy-listing-item__link" href="/policies/11497-proposals-f ...
## [9] <a class="m-policy-listing-item__link" href="/policies/13139-resolution- ...
## [10] <a class="m-policy-listing-item__link" href="/policies/11456-updated-mep ...
## [11] <a class="m-policy-listing-item__link" href="/policies/15028-france-2030 ...
## [12] <a class="m-policy-listing-item__link" href="/policies/15026-france-2030 ...
## [13] <a class="m-policy-listing-item__link" href="/policies/15025-france-2030 ...
## [14] <a class="m-policy-listing-item__link" href="/policies/14279-france-2030 ...
## [15] <a class="m-policy-listing-item__link" href="/policies/15029-france-2030 ...
## [16] <a class="m-policy-listing-item__link" href="/policies/14465-france-2030 ...
## [17] <a class="m-policy-listing-item__link" href="/policies/15027-france-2030 ...
## [18] <a class="m-policy-listing-item__link" href="/policies/15780-inner-mongo ...
## [19] <a class="m-policy-listing-item__link" href="/policies/13751-2022-eu-bud ...
## [20] <a class="m-policy-listing-item__link" href="/policies/13231-2022-2033-n ...
## ...
Now, these links all have a destination they are pointing to. We need
this destination! It is stored in the attribute “href”. Remember
attributes are key,value pairs either side of an = sign. In
our example, “href” is the key, and
“/policies/11663-fuel-economy-standards-on-light-duty-vehicles” is the
value. We can extract attributes using html_attr,
specificying the key of the attribute we want to
extract
link_destinations <- links %>% html_attr("href")
link_destinations
## [1] "/policies/11663-fuel-economy-standards-on-light-duty-vehicles"
## [2] "/policies/12654-emissions-limit-on-the-capacity-market-regulations"
## [3] "/policies/8506-gas-boilers-replacement-by-low-carbon-heating-systems"
## [4] "/policies/3124-local-government-fleet-renewal-mandate"
## [5] "/policies/12046-decommissioning-fossil-fuel-power-plants"
## [6] "/policies/8401-enhancements-to-minimum-energy-performance-standards-meps"
## [7] "/policies/12197-heavy-goods-vehicle-charge"
## [8] "/policies/11497-proposals-for-location-of-wind-power-turbines"
## [9] "/policies/13139-resolution-407152019-wholesale-energy-market-with-res-in-2023"
## [10] "/policies/11456-updated-meps-central-air-conditioners-and-heat-pumps"
## [11] "/policies/15028-france-2030-investment-plan-small-modular-reactor-investment"
## [12] "/policies/15026-france-2030-investment-plan-critical-minerals-investment"
## [13] "/policies/15025-france-2030-investment-plan-investment-in-renewable-energy-innovation"
## [14] "/policies/14279-france-2030-investment-plan"
## [15] "/policies/15029-france-2030-investment-plan-heavy-industry-decarbonisation-investment"
## [16] "/policies/14465-france-2030-investment-plan-hydrogen-sector-funding"
## [17] "/policies/15027-france-2030-investment-plan-clean-transport-investment"
## [18] "/policies/15780-inner-mongolia-coal-industry-development-14th-five-year-plan-coalbed-methane-development-and-utilization-supporting-scheme"
## [19] "/policies/13751-2022-eu-budget"
## [20] "/policies/13231-2022-2033-national-transport-plan-railway"
## [21] "/policies/13238-2022-2033-national-transport-plan-urban-growth-agreement"
## [22] "/policies/13239-2022-2033-national-transport-plan-water-transport"
## [23] "/policies/13240-2022-2033-national-transport-plan-new-airports"
## [24] "/policies/14813-2022-round-of-subsidies-for-e-mobility"
## [25] "/policies/14885-aud-7-million-grants-on-hydrogen-fuelled-cremations-fuel-cell-buses"
## [26] "/policies/14890-aid-for-municipal-infrastructure-for-just-transition"
## [27] "/policies/14499-budget-agreement-2022-green-initiatives-and-sustainable-energy-portfolio"
## [28] "/policies/15589-california-governors-budget-summary-lithium-valley-development"
## [29] "/policies/14212-carbon-neutrality-and-green-growth-act-for-the-climate-change"
## [30] "/policies/11685-clean-fuel-standard"
We apply this to our list of links, and get a character vector of links.
We are going to want to process all of these links, but let’s start
with the first one. Putting a number n into square brackets
after a vector gives us the nth item of that vector
link <- link_destinations[1]
link
## [1] "/policies/11663-fuel-economy-standards-on-light-duty-vehicles"
Now we want to visit this page, but the link is a relative link from
where we were before. We can get the whole link using the
paste0() function
absolute_link <- paste0("https://www.iea.org", link)
absolute_link
## [1] "https://www.iea.org/policies/11663-fuel-economy-standards-on-light-duty-vehicles"
If we put this link into our browser, it looks like this
Let’s start by extracting the text. If
we hover over the text, we can see it’s in the first
<p> (paragraph) element of a <div>
which has the class “m-block__content f-rte f-rte–block”.
We can select this with the css selector “div.m-block__content p”,
which means “give me all <p> elements which are
inside a <div> element that has the class
m-block__content. If we use html_element() as
opposed to html_elements() then we will just get the first
item that matches our search. In this case, that’s the one we want
# First we read the html from the link we came up with earlier
link_html <- read_html(absolute_link)
ps <- link_html %>% html_elements("div.m-block__content p")
print(ps)
## {xml_nodeset (2)}
## [1] <p>Japan sets and periodically updates fuel economy standards on cars, va ...
## [2] <p>\n Want to know more about this pol ...
p <- link_html %>% html_element("div.m-block__content p")
p
## {html_node}
## <p>
Now, if we apply html_text2(), or
html_text() (which does less formatting), we get the text
that is in that paragraph!
text <- p %>% html_text2()
text
## [1] "Japan sets and periodically updates fuel economy standards on cars, vans and trucks under its Top Runner Program. The efficiency requirements are based on the most fuel-efficient vehicles on the market, and manufacturers and importers of these vehicles are required to meet these vehicle efficiency standards on a corporate average basis. The fuel efficiency of passenger vehicles has improved by 96% over the past two decades. Japan has announced new fuel economy standards on light duty vehicles, aiming at improving fuel efficiency by 32% by 2030, compared with 2016 levels. The scope has been expanded to cover the efficiency of electric vehicles and plug-in hybrids, and new standards take into account the energy consumption of the fuel production (gasoline and electricity), the so-called 'well-to-wheel' approach."
Below the policy text, we can see some tags in boxes.
The html for these looks like this
Each set of tags is contained in a <div> element
(which is a type of box), that has the class
o_policy-content__list. Let’s select these with a
css_selector
tag_boxes <-link_html %>% html_elements("div.o-policy-content__list")
tag_boxes
## {xml_nodeset (4)}
## [1] <div class="o-policy-content__list">\n ...
## [2] <div class="o-policy-content__list">\n ...
## [3] <div class="o-policy-content__list">\n ...
## [4] <div class="o-policy-content__list">\n ...
We’ll want to process these one by one, so we can do a loop. For now, let’s just take the second one
tag_box <-tag_boxes[2]
tag_box
## {xml_nodeset (1)}
## [1] <div class="o-policy-content__list">\n ...
If we look carefully at the html, we can see that within each
<div> there is a <span> with the
class o-policy-content-list__title. The text of that
contains the type of tag. If we run html_element() on an
element we have already identified (instead of the whole html), then we
will look for our selection among that element’s children.
title_span <- tag_box %>% html_element("span.o-policy-content-list__title") # search within our tag_box for span elements with the class o-policy....
title <- html_text(title_span) # Get the text of that element
title
## [1] "Policy types"
Now we want to extract the tags themselves. These are contained in
<span> elements that have the class
a-tag__label.
tag_spans <- tag_box %>% html_elements("span.a-tag__label") # search within our tag_box for span elements with the class a-tag__label
tags <- html_text(tag_spans) # Get the text of those elements
tags
## [1] "Regulation"
## [2] "Energy efficiency / Fuel economy obligations"
## [3] "Performance-based policies"
If we want to put these into a dataframe, we will need a single text value. We can use paste to paste these together.
tag_string <- paste(tags, collapse="; ") # collapse tells us to collapse a vector of strings into a single string separated by semicolons and spaces
tag_string
## [1] "Regulation; Energy efficiency / Fuel economy obligations; Performance-based policies"
Now we are ready to put it all together inside a loop. In this loop we will process each link in turn. We’ll just take the first 10 links, so as not to bother the IEA too much
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df <- NULL # Let's initialise a null object which we will our dataframe to after each link
for (link in link_destinations[1:10]){
# First we'll initialise an empty list to store our values
l <- list()
# Then we create our link and parse it
absolute_link <- paste0("https://iea.org", link)
print("processing link")
print(absolute_link)
link_html <- read_html(absolute_link)
# Now we get our text from the paragraph as before
p <- link_html %>% html_element("div.m-block__content p")
l$text <- html_text2(p) # and add it to our list of attributes
# Now we process the sidebar
keys <- link_html %>% html_elements("span.o-policy-aside-item__title") %>% html_text()
values <- link_html %>% html_elements("span.o-policy-aside-item__value") %>% html_text()
for (i in 1:length(keys)) { # Do a loop were we increase the value of i from 1 to the length of our keys
key = keys[i] # Get the ith key in our vector of keys
value = values[i] # Get the ith value in our vector of values
l[key] = value # assign the value to the attribute that is named key
}
# Now we process the tags
tag_boxes <-link_html %>% html_elements("div.o-policy-content__list")
for (tag_box in tag_boxes) {
title_span <- tag_box %>% html_element("span.o-policy-content-list__title") # search within our tag_box for the title span
title <- html_text(title_span) # Get the text of that element
tag_spans <- tag_box %>% html_elements("span.a-tag__label") # search within our tag_box for span elements with the class a-tag__label
tags <- html_text(tag_spans) # Get the text of those elements
tag_string <- paste(tags, collapse="; ") # collapse tells us to collapse a vector of strings into a single string separated by semicolons and spaces
l[title] <- tag_string # We can add to the list we made earlier
}
df <- bind_rows(df, as.data.frame(l))
}
## [1] "processing link"
## [1] "https://iea.org/policies/11663-fuel-economy-standards-on-light-duty-vehicles"
## [1] "processing link"
## [1] "https://iea.org/policies/12654-emissions-limit-on-the-capacity-market-regulations"
## [1] "processing link"
## [1] "https://iea.org/policies/8506-gas-boilers-replacement-by-low-carbon-heating-systems"
## [1] "processing link"
## [1] "https://iea.org/policies/3124-local-government-fleet-renewal-mandate"
## [1] "processing link"
## [1] "https://iea.org/policies/12046-decommissioning-fossil-fuel-power-plants"
## [1] "processing link"
## [1] "https://iea.org/policies/8401-enhancements-to-minimum-energy-performance-standards-meps"
## [1] "processing link"
## [1] "https://iea.org/policies/12197-heavy-goods-vehicle-charge"
## [1] "processing link"
## [1] "https://iea.org/policies/11497-proposals-for-location-of-wind-power-turbines"
## [1] "processing link"
## [1] "https://iea.org/policies/13139-resolution-407152019-wholesale-energy-market-with-res-in-2023"
## [1] "processing link"
## [1] "https://iea.org/policies/11456-updated-meps-central-air-conditioners-and-heat-pumps"
df
## text
## 1 Japan sets and periodically updates fuel economy standards on cars, vans and trucks under its Top Runner Program. The efficiency requirements are based on the most fuel-efficient vehicles on the market, and manufacturers and importers of these vehicles are required to meet these vehicle efficiency standards on a corporate average basis. The fuel efficiency of passenger vehicles has improved by 96% over the past two decades. Japan has announced new fuel economy standards on light duty vehicles, aiming at improving fuel efficiency by 32% by 2030, compared with 2016 levels. The scope has been expanded to cover the efficiency of electric vehicles and plug-in hybrids, and new standards take into account the energy consumption of the fuel production (gasoline and electricity), the so-called 'well-to-wheel' approach.
## 2 The Capacity Market Regulations emissions limit aims to reduce the amount of CO2 emitted per unit of electricity. The Polish Electricity Networks made the amendment in view of adapting to EU regulations on the fulfilment of emissions limit for units participating in the capacity auctions, expected to start in July 2025. The limit is 550g carbon dioxide from fossil fuels per kWh of net electricity produced. Certification for the auction for 2025 was in September 2020.
## 3 Gas boilers in the UK will be replaced by low-carbon heating systems in all new homes built after 2025.
## 4 All new buses and coaches that shall be acquired for public transport services from 2025 onwards must be low-emission vehicles.
## 5 Within the Strategy for the environmental policy of the Slovak Republic, the Slovak government decided to reduce the use of coal for electricity generation and has adopted an action plan to achieve this goal.\n\nA pilot project for the Upper Nitra region has been developed with the support of the European Union
## 6 Want to know more about this policy ? Learn moreLearn more
## 7 As of 2023 the government intends to introduce a levy on truck traffic. This will be applied to dutch and foreign trucks of more than 3500 kg. based on the kilometer distance and roads used. The revenues will be used for innovation woards more sustainable road traffic. Relevant parties will be involved in decisions on re-investing the revenues.
## 8 These proposals consider the National Energy Independence Strategy and the National Energy and Climate plan to install a wind farm in the Baltic Sea. The wind farm would reach 700 MW and produce 2.5-3 TWh of electricity per year, which is 25% of Lithuania's electricity demand. It may take up to 8 years to install and the territory planned in the Baltic Sea covers an area of 137.5 square km, with an average wind speed of 9 m/s.
## 9 To reach a more eco-friendly energy matrix, the Ministry of Mines and Energy approved resolution 40715 that stipulates that, as of 2023, at least 10% of electricity purchases of wholesalers of the Wholesale Energy Market destined to serve end users must come from renewables (FNCER), through long-term contracts.
## 10 This policy applies to residential central air conditioners and heat pumps installed as part of a home's central heating and cooling system. Residential central air conditioners and heat pumps include split system central air conditioners and heat pumps; single package central air conditioners; single package heat pumps; small-duct high-velocity products; and space constrained products.
## Country Year Status Jurisdiction Topics
## 1 Japan 2030 Ended National Energy Efficiency
## 2 Poland 2025 Planned National <NA>
## 3 United Kingdom 2025 Planned National Energy Efficiency
## 4 France 2025 Planned National Energy Efficiency
## 5 Slovak Republic 2023 Planned National <NA>
## 6 Singapore 2023 Planned National Energy Efficiency
## 7 Netherlands 2023 Planned National Energy Efficiency
## 8 Lithuania 2023 Planned National Renewable Energy
## 9 Colombia 2023 Planned National Renewable Energy
## 10 United States 2023 Planned National Energy Efficiency
## Policy.types
## 1 Regulation; Energy efficiency / Fuel economy obligations; Performance-based policies
## 2 Strategic plans; Codes and standards; Nationally Determined Contribution; Targets, plans and framework legislation; Climate change strategies
## 3 Regulation; Other regulatory instruments
## 4 Regulation
## 5 Strategic plans; Prohibition; Technology bans / phase outs
## 6 Regulation; Codes and standards; Product-based MEPS; Minimum energy performance standards; Performance-based policies
## 7 Payments, finance and taxation; Taxes, fees and charges; Use and activity charges; Road usage charges
## 8 <NA>
## 9 Regulation; Mandatory energy management system; Prescriptive requirements and standards; Energy market regulation; Market design rules
## 10 Regulation; Codes and standards; Minimum energy performance standards
## Sectors
## 1 Road transport
## 2 Power, Heat and Utilities; Power generation; Electricity and heat generation
## 3 Buildings
## 4 Transport; Road transport; Passenger transport (Road); Mass road transit; Buses and minibuses - Local and urban service
## 5 Combined heat and power; Fuel processing and transformation; Coal secondary products production
## 6 Buildings; Residential; Services
## 7 Road transport
## 8 <NA>
## 9 Power, Heat and Utilities; Power transmission and distribution
## 10 Buildings; Residential
## Technologies
## 1 Transport technologies
## 2 <NA>
## 3 Space, water and process heating technologies; Domestic and building-scale boilers
## 4 Road vehicles; Buses and coaches; Drive train or engine; Battery electric; Plug-in hybrid; Transport technologies; Vehicle type
## 5 <NA>
## 6 Lighting technologies; Exterior lighting (incl. street); Light producing technologies; Incandescent; Compact fluorescent lamp; Light emitting diode (LED)
## 7 <NA>
## 8 <NA>
## 9 <NA>
## 10 Space cooling; Centralised AC system; Airconditioners (ACs); Heating, cooling and climate control technologies